Name | Version | Summary | date |
superoptix |
0.1.0b8 |
Full Stack Agentic AI Framework |
2025-08-02 15:48:45 |
zeroeval |
0.6.8 |
ZeroEval SDK |
2025-08-02 06:18:53 |
openjury |
0.1.0 |
Python SDK for evaluating multiple model outputs using configurable LLM-based jurors |
2025-08-01 19:36:43 |
python-flexeval |
0.1.5 |
FlexEval is a tool for designing custom metrics, completion functions, and LLM-graded rubrics for evaluating the behavior of LLM-powered systems. |
2025-08-01 01:20:35 |
llama-index-packs-rag-evaluator |
0.4.0 |
llama-index packs rag_evaluator integration |
2025-07-30 20:54:25 |
dyff-audit |
0.11.1 |
Audit tools for the Dyff AI auditing platform. |
2025-07-30 17:35:43 |
agenta |
0.50.3 |
The SDK for agenta is an open-source LLMOps platform. |
2025-07-29 17:42:14 |
quotientai |
0.4.6 |
Python library for tracing, logging, and detecting problems with AI Agents |
2025-07-29 14:28:52 |
trajectopy |
3.1.2 |
Trajectory Evaluation in Python |
2025-07-29 12:42:26 |
dyff-client |
0.18.0 |
Python client for the Dyff AI auditing platform. |
2025-07-28 18:51:39 |
pymcpevals |
0.1.1 |
Python package for evaluating MCP (Model Context Protocol) server implementations using LLM-based scoring |
2025-07-27 07:17:20 |
mandoline |
0.4.0 |
Official Python client for the Mandoline API |
2025-07-26 20:32:40 |
SurvivalEVAL |
0.4.5 |
The most comprehensive Python package for evaluating survival analysis models. |
2025-07-26 06:19:12 |
dyff-schema |
0.30.1 |
Data models for the Dyff AI auditing platform. |
2025-07-25 17:35:17 |
evalassist |
0.1.20 |
EvalAssist is an open-source project that simplifies using large language models as evaluators (LLM-as-a-Judge) of the output of other large language models by supporting users in iteratively refining evaluation criteria in a web-based user experience. |
2025-07-25 16:44:14 |
monitoring-rag |
0.0.2 |
A comprehensive, framework-agnostic library for evaluating Retrieval-Augmented Generation (RAG) pipelines. |
2025-07-24 11:25:53 |
novaeval |
0.4.0 |
A comprehensive, open-source LLM evaluation framework for testing and benchmarking AI models |
2025-07-22 19:20:41 |
evalscope |
0.17.1 |
EvalScope: Lightweight LLMs Evaluation Framework |
2025-07-21 02:12:56 |
grandjury |
1.0.1 |
Python client for GrandJury server API - collective intelligence for model evaluation |
2025-07-18 05:08:40 |
rag-evaluation |
0.2.2 |
A robust Python package for evaluating Retrieval-Augmented Generation (RAG) systems. |
2025-07-17 08:30:01 |